Method for Removing and/or Detecting Nucleic Acids Having Mismatched Nucleotides

ABSTRACT

Provided herein, among other things, are various in vitro methods that involve cleaving dsDNA molecules that comprise a mismatched nucleotide using EndoMS. In some embodiments, the method may comprise ligating a T-tailed double-stranded adapter to A-tailed double-stranded fragments of nucleic acid to produce ligation products that comprise adapter-ligated fragments and double-stranded adapter dimers that comprise a T:T mismatch at the ligation junction and cleaving both strands of the adapter dimers using EndoMS.

BACKGROUND

During next generation DNA sequencing, the DNA molecules to be sequenced are often sheared into fragments, repaired, and adapters of known sequence are ligated to the insert library. A key step in DNA sequencing sample preparation is ligation of oligonucleotide adapters to a population of DNA fragments. The DNA fragments are typically 3′ tailed with dA to prevent self-ligation of library DNA. Adapters are designed having a 3′-T overhang to preferentially ligate to the 3′-dA of the fragments. During ligation, the adapters are in excess over the fragments in order to maximize ligation efficiency. Typically, a molar ratio of 10:1 adapter:insert is recommended to maximize ligation efficiency (Head, et al., Biotechniques, 2014, 56, 61-68). However, use of higher adapter:fragment ratios lead to “adapter dimers” that result from self-ligation of the adapters directly to each other rather than a library insert sequence.

The problem of adapter dimers is magnified if the input sample amount is low or of poor quality such as DNA or cDNA from biopsies, FFPE, tissues or single cells (Head 2014). With low DNA input or with poor quality input, the ratio of adapter:insert is greater than 10:1 (for example 100:1, 1000:1 or 10,000:1) and leads to more adapter dimers and low library conversion efficiency. Adapter dimers are also problematic during small RNA library preparation and form the majority of ligated DNA products (Shore, et al., PLoS One, 2016, 11, e0167009).

Once formed, adapter dimers are more efficiently amplified during PCR than libraries containing longer inserts. Due to their short size, adapter dimers form clusters on sequencing flow cells very efficiently. However, because adapter dimers contain no insert, sequencing the adapter dimer yields no useful data. In an Illumina sequencing run, a low level (5%) of adapter dimer contamination can result in 60% of sequencing reads coming from adapter dimers. Adapter dimer contamination therefore lowers the DNA sequencing quality and output and increases the cost of sequencing.

To minimize the formation and accumulation of DNA adapter dimers during sample preparation, several strategies have been developed to separate adapter dimers from libraries with inserts ligated to adapters. For example, adapter dimers can be removed using beads. Alternatively, ligated DNA libraries can be separated from adapter dimer by gel electrophoresis and the band corresponding to the library cut out and purified from the gel. Other methods include the use of blocking locked nucleic acids (LNAs) to reduce adapter dimer ligation (Kawano, et al., Biotechniques, 2010, 49, 751-755). However these methods lead to overall sample loss and limit automation of library construction (Shore, et al., Methods in Molecular Biology, 2018, 1712, 145-161).

US 2014/0356867 describes cleavage of adapter dimers using Cas9. However, a major source of adapter dimers—adapter dimers that contain a T-T mismatch—are not described in this publication. Moreover, not only may the guide RNAs hybridize to genomic sequences to produce undesirable off target cleavage, but introduction of guide RNA molecules into an amplification and/or sequencing reaction can potentially cause additional artefacts. In another example, WO 2013/188037 describes a method by which CRISPR stem loops are engineered into RNA adapters so that dimers of those adaptors can be recognized by Cas6 and cleaved prior to reverse transcription. WO 2013/188037 makes no mention of adapter dimers that contain a T-T mismatch, or

DNA adaptors. In view of the above, methods are needed to eliminate adapter dimers to enable higher quality DNA sequencing especially at low input.

Reducing or eliminating adapter dimers would enable higher library conversion efficiency of both normal and low input libraries due to 1) higher ratios of adapter:insert that increase ligation efficiency and 2) higher PCR efficiency of libraries in the absence of adapter dimers and resulting in higher quality and yield of DNA sequencing.

SUMMARY

Many of the adapters used for the construction of next generation sequencing libraries have a single nucleotide 3′ overhang. Such adapters, in theory, are only capable of ligating to other molecules that contain a single nucleotide 3′ A overhang providing the adaptor overhang is complementary to A. Nucleotides complementary to A are referred to as T. Adaptors should not ligate to other molecules that contain a single nucleotide 3′ T overhang. Throughout the present specification and claims, “T” includes analogs such as U and modified Ts and modified Us.

It has been found that the adapter dimers created during next generation sequencing library construction often contain a T:T mismatch at the ligation junction. These molecules can be efficiently eliminated using an EndoMS as described herein. EndoMS specifically cleaves both strands of a double-stranded DNA (dsDNA) only if it contains a mismatch. In some embodiments, the EndoMS treatment step may additionally remove molecules that contain damaged nucleotides from the sample.

A variety of methods and kits are described herein. In some embodiments, the method for cleaving adapter dimers produced during a ligation reaction, may include: (a) ligating a T-tailed double-stranded adapter to A-tailed double-stranded fragments of nucleic acid to produce ligation products that comprise: (i) adapter-ligated double-stranded nucleic acid fragments and (ii) double-stranded adapter dimers that comprise a T:T mismatch at the ligation junction; and (b) cleaving both strands of the adapter dimers using EndoMS.

A method for cleaving a nucleic acid is also provided. In some embodiments, the method may include: hybridizing the nucleic acid with an oligonucleotide that is not perfectly complementary to a target sequence within the nucleic acid, to produce a duplex that comprises one or more single nucleotide mismatches; and treating the duplex with EndoMS, thereby cleaving the nucleic acid at the target sequence containing the single nucleotide mismatch.

A method for identifying a single mismatched nucleotide in a double-stranded nucleic acid is also provided. In these embodiments, the method may comprise: (a) reacting a sample comprising the double-stranded nucleic acid with an EndoMS to produce a reaction product, wherein the EndoMS cleaves both strands of the nucleic acid only if it contains a mismatch; (b) subjecting the reaction product of (a) to amplification under conditions that amplify the double-stranded nucleic acid if it is uncleaved; and (c) detecting the presence of an amplification product, wherein the presence of the product indicates that the double-stranded nucleic acid does not have a mismatched nucleotide and the absence of a product indicates that the double-stranded nucleic acid has a mismatched nucleotide.

Other embodiments may include targeting mismatches in purified genomic DNA using EndoMS. Other embodiments may include targeting mismatches in vivo in nucleic acids in eukaryotic cells using bacterial of archaeal EndoMS genes delivered by transformation using extra chromosomal DNA. Alternatively, EndoMS proteins may be delivered in vivo in the eukaryotic cells using liposomes or various transport proteins known in the art.

BRIEF DESCRIPTION OF THE FIGURES

The skilled artisan will understand that the drawings, described below, are for illustration purposes only. The drawings are not intended to limit the scope of the present teachings in any way.

FIG. 1A-1B EndoMS eliminates double-stranded oligonucleotides that have a mismatch.

FIG. 1A shows the sequences of the double-stranded oligonucleotides used.

FIG. 1B shows EndoMS was incubated with either the T:A or T:T substrate for various times (0-60 minutes), and the reaction was halted with 10 mM EDTA. Reactions were separated and analyzed by capillary electrophoresis. EndoMS had no activity on matched T:A substrates but cleaved two nucleotides (nt) 5′ to a T:T mismatch resulting in a smaller 9 nt product.

This data shows that EndoMS can efficiently eliminate double-stranded oligonucleotides that contain a mismatch.

FIG. 2A-2B shows EndoMS cleaves a T:T adapter dimer mismatch. Adapters were ligated, and adapter dimers having a T:T mismatch at the ligated adapter junction were produced.

FIG. 2A shows the structures of the adapter dimers produced by ligating T-tailed hairpin oligonucleotides together.

FIG. 2B is a gel showing the analysis of reaction products. Fragments were separated by 15% TBE-Urea gel electrophoresis. Lane 1 shows uncleaved adapter dimers. A fragment of 130 nt was observed. Lane 2 shows adapter dimers treated with EndoMS. A fragment of 65 nt pieces was observed.

This data shows that adapters that contain a T overhang can ligate to each other to produce a dimer that contains a T:T mismatch. This data also shows that EndoMS can cleave those adapter dimers.

FIG. 3A-3C shows EndoMS depletes adapter dimers in next generation sequencing libraries. Human genomic DNA (10 ng) was sheared into 300 nt fragments, end repaired, dA-tailed then purified using SPRI® beads (Beckman Coulter, Brea, Calif.). Adapters (15 μM) were ligated to the insert library and purified using SPRI beads. An aliquot was treated with 0.3, 1.25 or 5 pmol EndoMS (+EndoMS: black trace) or water (−EndoMS: gray trace) and incubated for 1 hour at 37° C. Reactions were then PCR amplified for 10 cycles using Index Primer 1 and Universal Primer. Reaction products were separated and analyzed using the Agilent Bioanalyzer® (Agilent Technologies, Santa Clara, Calif.). EndoMS treatment (black) depletes adapter dimer formation.

FIG. 3A shows results obtained using 0.3 pmol EndoMS.

FIG. 3B shows results obtained using 1.25 pmol EndoMS.

FIG. 3C shows results obtained using 5 pmol EndoMS.

This data shows that EndoMS is very effective at cleaving the adapter dimers produced during next generation sequencing library construction.

FIG. 4 schematically illustrates how EndoMS can be used to eliminate adapter dimers during NGS library construction. In this example, after adapter ligation, the DNA is purified from DNA ligase and then treated with EndoMS at 37° C. for 30 minutes in NEBNext® High-Fidelity 2×PCR Master Mix (New England Biolabs, Ipswich, Mass.). Adapter dimers are cleaved and are therefore not used as a substrate for PCR. Then PCR cycling is initiated to amplify the libraries.

FIG. 5 shows how a mismatch oligonucleotide can target DNA cleavage using EndoMS. A mismatch oligonucleotide containing at least one T:T or U:U mismatch is hybridized to a target DNA. EndoMS cleaves at the T:T or U:U mismatch creating a dsDNA break.

FIG. 6 shows a method for detecting a mismatched nucleotide. In this method, a sample is digested with a mismatch-specific endonuclease and a sequence is amplified, e.g., by PCR. If the sequence does not contain a mismatch, then an amplification product should be obtained. If the sequence does contain a mismatch, then no amplification product should be obtained.

FIG. 7 shows an alignment of wild type EndoMS proteins.

DETAILED DESCRIPTION

Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are described.

All patents and publications, including all sequences disclosed within such patents and publications, referred to herein as well as U.S. Provisional Application Ser. No. 62/525,803, filed Jun. 28, 2017 are expressly incorporated by reference.

Numeric ranges are inclusive of the numbers defining the range. Unless otherwise indicated, nucleic acids are written left to right in 5′ to 3′ orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively.

The headings provided herein are not limitations of the various aspects or embodiments of the invention. Accordingly, the terms defined immediately below are more fully defined by reference to the specification as a whole.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Singleton, et al., DICTIONARY OF MICROBIOLOGY AND MOLECULAR BIOLOGY, 2D ED., John Wiley and Sons, New York (1994), and Hale & Markham, THE HARPER COLLINS DICTIONARY OF BIOLOGY, Harper Perennial, N.Y. (1991) provide one of skill with the general meaning of many of the terms used herein. Still, certain terms are defined below for the sake of clarity and ease of reference.

As used herein, the term “EndoMS” refers to any member of the conserved family of endonucleases that catalyzes a double-stranded break at sites that contain a mismatched nucleotide. Examples are provided by the 104 related sequences presented herein as aligned sequences in FIG. 7 and in the sequence listings. These enzymes are considered to be a member of the RecB family of nucleases. EndoMS may also be referred to as nucS. The term “EndoMS” used herein therefore includes any of the known wild-type EndoMS family members. EndoMS catalyzes a double-stranded break at sites that contain a mismatched nucleotide. These enzymes may have a naturally-occurring amino acid sequence or may be variants of the naturally occurring proteins. For example, in certain embodiments, variants may have at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 98%, or at least 99% amino acid sequence identity with the naturally occurring amino acid sequence of an EndoMS family member, for example, a single EndoMS selected from the 104 sequences in FIG. 7.

The amino acid sequences of EndoMS from some bacteria and some archaea are conserved as shown in FIG. 7. For example, this enzyme family contains characteristic motifs such as a GxhDxh motif, a hxffEhK motif, and a hxxYxxhh motif, in which h is a hydrophobic residue and x is any amino acid) as described in Ren, et al. 2009 EMBO J., 28, 2479-2489 and Nakai, 2016 Structure 24: 11 1960-1971. The active site (Y171) and three DNA binding site residues (R42, R70 and W75) are essential in family member Pyrococcus abyssi NucS. Corresponding residues D165A, E179A, and K181A were found to be essential in the Thermococcus kodakarensis EndoMS enzyme (see for example, see Ishino, 2016 Nucl. Acids Res. 44: 2977-2986 and Ariyoshi, et al. 2016 Structure, 24(11), 1859-1861).

It can be seen from the sequence alignments shown in FIG. 7 that K at position 181 is conserved in 100% of the 104 EndoMS proteins analyzed. D165 is conserved in 103/104 EndoMS sequences and E179 is conserved in 100% of the 104 EndoMS sequences.

This corresponds to XXXXXX in the sequence listings for SEQ ID NOs:1-104.

Additionally, in some embodiments, EndoMS may be characterized by P72 V (or I) 73 N74 W75 Q76 P (or A) 77 and P (or S) 78.

Additionally, in some embodiments, EndoMS may be characterized by one or more amino acids at the following positions in the sequence alignment: P9, C34, Y38, G40, S44, L48, G50, K57, D59, G60, H65, P100, E102, D121, G129, E131, P143, F150, E156, G162, G168, D170, E179, K181, A189, V190, Q192, Y196, R219, G220, V222, P224, A231, E240 in which the actual position can be determined from the sequence listing.

These positions of the amino acids are derived from the alignment provided in FIG. 7.

The phylogeny and structure of this family has been studied (see Ishino, supra). Examples of wild type EndoMS family members include, but are not limited to: Thermococcus kodakarensis (encoded by TK1898), Pyrococcus furiosus (encoded by PF0012) Pyrococcus abyssi (encoded by PAB2263), Halobacterium sp. NRC-1 (encoded by VNG0171C), Methanocaldococcus jannaschii (encoded by MJ_0225), Methanocella paludicola SANAE (encoded by MCP_1445), Methanobacterium thermoautotrophicum (encoded by MTH1816), Methanopyrus kandleri (encoded by MK0507), Sulfolobus solfataricus (encoded by SS02208), Pyrobaculum calidifontis (encoded by JCM 11548, Pcal_0508), Ignisphaera aggregans (encoded by DSM 17230, Igag_1168), Aeropyrum pernix (encoded by APE_0957), Candidatus Caldiarchaeum subterraneum (encoded by CSUB_C1217), Thaumarchaeota archaeon (encoded by SCGC, AB-539-E09), Streptomyces cattleya (encoded by NRRL DSM 46488, SCAT_4205), Rhodococcus jostii (encoded by RHA1, RHA1_ro01459), Mycobacterium colombiense (encoded by CECT 3035, MCOL_04050) and Actinomyces urogenitalis (encoded by DSM 15434, HMPREF0058_1558). The sequences of these proteins are described in Ishino, Nucl. Acids Res. 2016, 44: 2977-2986, and are incorporated by reference herein.

As used herein, the term “T-tailed double-stranded adapter” refers to a double-stranded adapter that contains an end that has an overhang of a single T nucleotide (as described herein). A double-stranded adapter may be 20 to 150 bases in length, e.g., 40 to 120 bases or 50-80 bases; or shorter such as 10-40 or 20-30 bases. A double-stranded adapter may contain one or more single-stranded regions in addition to a double-stranded region that is tailed with a T.

As used herein, the nucleotide “T” is intended to include T as well as analogs of T (including U, modified T and modified U) that are still capable of base pairing with an A. Such modifications include modifications to the base and/or the sugar. Examples of modified Ts include 2-Thiothymidine-5′-triphosphate, 4-Thiothymidine-5′-Triphosphate, or 2′-Deoxythymidine-5′-O-(1-Thiotriphosphate) (TriLink Biotechnologies, San Diego, Calif.). Corresponding modifications of U may also be found and/or synthesized.

The term “adapter-ligated,” as used herein, refers to a nucleic acid that has been tagged by, i.e., covalently linked with, an adapter. An adapter can be joined to a 5′ end and/or a 3′ end of a nucleic acid molecule.

The term “Y-adapter” refers to an adapter that contains a double-stranded region, and a single-stranded non-complementary region. The adapter is designed to have a T base overhang at the 3′ end of the double stranded portion of the adapter DNA. The end of the double-stranded region can be joined to target nucleic acids such as double-stranded fragments of genomic DNA, e.g., by ligation where the 3′ end of the target nucleic acid has a single base overhang which is an A. The addition of an A is a byproduct of replication of the DNA by Taq polymerase.

Each strand of an adapter-ligated target nucleic acid that has been ligated to a Y-adapter is asymmetrically tagged in that it has the sequence of one strand of the Y-adapter at one end and the other strand of the Y-adapter at the other end. Amplification of target nucleic acids that have been joined to Y-adapters at both ends results in an asymmetrically ligated nucleic acid, i.e., a nucleic acid that has a 5′ end containing one adapter sequence and a 3′ end that has another adapter sequence.

The terms “hairpin adapter” and “loop adapter” refer to an adapter that is in the form of a hairpin. In one embodiment, after ligation of a “hairpin adapter”, the hairpin loop can be cleaved to produce strands that have non-complementary tags on the ends. In some cases, the loop of a hairpin adapter may contain a uracil residue, and the loop can be cleaved using uracil DNA glycosylase and endonuclease VIII, although other methods are known.

As used herein, the term “A-tailed double-stranded nucleic acid fragments” refers to a population of double-stranded nucleic acid fragments that have an overhang of a single A at each end. Such fragments are commonly made from fragmented DNA that has been polished and then tailed by Taq polymerase.

As used herein, the term “double-stranded adapter dimers” refers to the product of ligation between two double-stranded adapter molecules. An example of an adapter dimer is shown in FIG. 2A revealing a T-T mismatch. These “adapter dimers” can result from self-ligation of the adapters directly to each other rather than a library insert sequence Library conversion efficiency is the fraction of ligated library inserts ligated compared to the total unligated inserts, ligated inserts and adapter dimers (Library conversion efficiency=(ligated inserts/(ligated inserts+unligated inserts+adapter dimers). High library conversion efficiency is the result of all library inserts ligated to adapters with minimal adapter dimers. Alternatively, low library conversion efficiency is the result of low insert ligation efficiency and/or high efficiency adapter dimer ligation.

As used herein, the term “mismatched nucleotide” or “single nucleotide polymorphism” refers to a pair of nucleotides that oppose each other but are not complementary in a double-stranded nucleic acid. Examples of mismatched nucleotides include T:T, U:U, A:A, C:C, G:G, T:G, T:C, A:G, and A:C. In one embodiment, the mismatched nucleotide is a T:T mismatch. By way of example, a T:T mismatch may form by ligation between two double-stranded adapter molecules each having a 3′ overhang of a single T nucleotide.

As used herein, the term “mismatch at the ligation junction” refers to a pair of opposing nucleotides that are not complementary that are each present adjacent to a phosphodiester bond produced in a ligation. In one embodiment, the pair of opposing nucleotide are both T nucleotides, resulting in a T:T mismatch at the ligation junction”. This embodiment includes the pair of opposing nucleotide being U nucleotides, giving rise to a U:U mismatch at the ligation junction

As used herein, the term “thermostable” refers to an enzyme that has optimal activity at a temperature of at least 50° C.

As used herein, the term “not perfectly complementary” refers to two sequences that are sufficiently complementary to allow the two sequences to hybridize to one another under high stringency conditions to produce a duplex, but wherein the duplex formed by hybridization of the two sequences comprises at least one mismatch, e.g., a single mismatch. Two sequences that are “not perfectly complementary” are therefore less than 100% complementary (but may be at least 90%, 95%, 98% or 99% complementary).

As used herein, the term “without enriching for the adapter-ligated fragments by size” refers to a method in which there is no step that selects for adapter-ligated fragments by their size, e.g., using a product designed to purify PCR products from adapters, such as a QIAquick® (Qiagen, Germantown, Md.) or SPRI column.

As used herein, the term “nucleic acid” refers to RNA or DNA and include RNA or DNA oligonucleotides, exons, introns, other non coding DNA, unspecified genomic DNA, whole genomes, cellular RNA species such as non coding RNA, miRNA, mRNA, tRNA, rRNA etc. The nucleic acid may be short (less than 100 nt), medium (100 nt-100 kb) or long in length (greater than 100 kb) and may comprise or consist of target sequences.

In some embodiments, the method may comprise ligating a double-stranded portion of an adapter with a T-tail as defined herein at the 3′ end of one strand to double-stranded fragments of nucleic acid having an A-tail at the 3′ end to produce ligation products that comprise double-stranded nucleic acid fragments having adapters ligated at each end and double-stranded adapter dimers that comprise a T:T mismatch at the ligation junction, and cleaving both strands of the adapter dimers using EndoMS.

The A-tailed double-stranded fragments of nucleic acid may be made by extracting nucleic acid (e.g. DNA) from an initial sample, optionally fragmenting the nucleic acid (e.g. DNA), polishing the ends of the nucleic acid fragments (using, e.g., T4 DNA polymerase) and A-tailing the polished fragments (using, e.g., Taq polymerase). In some embodiments, the initial sample may contain intact double-stranded nucleic acids (e.g. dsDNA). In these embodiments, the sample may be fragmented before it is A-tailed. In these embodiments, fragmenting may be done mechanically (e.g., by sonication, nebulization, or shearing, etc.) or using a double-stranded DNA “dsDNA” Fragmentase® enzyme (New England Biolabs, Ipswich Mass.). In other embodiments, the nucleic acid (e.g. DNA) in the sample may already be fragmented (e.g., as is the case for FPET samples and circulating cell-free DNA (cfDNA), e.g., ctDNA).

The T-tailed adapters are synthetic oligonucleotides that form Y-adaptors or loop adaptors both of which contain a double stranded region and a single stranded region and both have the T-tail at the 3′ terminus of the double stranded region.

In some embodiments, the A-tailed double-stranded nucleic acid fragments may have a median size that is below 1 kb (e.g., in the range of 50 bp to 500 bp, 80 bp to 400 bp, or 200-400 bp), although fragments having a median size outside of this range may be used. In some embodiments, the amount of nucleic acid (e.g. DNA) in a sample may be limiting. For example, the sample of fragmented DNA may contain less than 200 ng of fragmented human DNA, e.g., 1 pg to 20 pg, 10 pg to 200 ng, 100 μg to 200 ng, 1 ng to 200 ng or 5 ng to 50 ng, or less than 10,000 (e.g., less than 5,000, less than 1,000, less than 500, less than 100, less than 10 or less than 1) haploid genome equivalents, depending on the genome.

As would be apparent, the ligation step may be done using any suitable ligase including, but not limited to T4 DNA ligase.

As noted above, the EndoMS treatment not only removes adapter dimers from the reaction, but potentially also double-stranded nucleic acid molecules (e.g. dsDNA molecules) that contain single nucleotide mismatches. Single nucleotide mismatches may result from damaged nucleotides that are read by a polymerase as a different nucleotide and, as such, damaged nucleotides can confound the results obtained by sequencing those fragments. For example, deaminated cytosines and oxidized guanines both base pair with adenine, which lead to erroneous base calls after amplification. It has been found that damaged nucleotides are a pervasive cause of sequencing errors and this, in turn, confounds variant identification (see, e.g., Chen, et al., Science 2017 355:752-756). In these cases, a “damaged nucleotide” is any derivative of adenine, cytosine, guanine, and thymine that has been altered in a way that allows it to pair with a different base. In non-damaged DNA, A base pairs with T and C base pairs with G. However, some bases can be oxidized, alkylated or deaminated in a way that effects base pairing. For example, 7,8-dihydro-8-oxoguanine (8-oxo-dG) is a derivative of guanine that base pairs with adenine instead of cytosine. This derivative causes a G to T transversion after replication. Deamination of cytosine produces uracil, which can base pair with adenine, leading to a C to T change after replication. Other examples of damaged nucleotides that are capable of mismatched pairing are known. Removal of molecules that contain damaged nucleotides can provide more reliable sequencing data.

In some embodiments, the method may comprise amplifying the adapter-ligated double-stranded nucleic acid fragments after the reaction has been treated with EndoMS. In these embodiments, the amplification may be done by PCR (e.g., using a first primer that hybridizes to an adapter sequence and another primer that hybridizes to the complement of an adapter sequence). As would be apparent, the primers used for amplification and/or the adapters may be compatible with use in any next generation sequencing platform, e.g., Illumina's reversible terminator method, Roche's pyrosequencing method (454), Life Technologies' sequencing by ligation (the SOLID® platform), Life Technologies' Ion Torrent platform or Pacific Biosciences' fluorescent base-cleavage method, etc. The fragments may be sequenced without amplification, or after they are amplified. Examples of such methods are described in the following references: Margulies, et al., Nature, 2005, 437:376-80); Ronaghi, et al., Analytical Biochemistry, 1996, 242:84-9; Shendure, Science, 2005, 309:1728; Imelfort, et al., Brief Bioinform. 2009, 10:609-18; Fox, et al., Methods Mol Biol. 2009, 553:79-108; Appleby, et al., Methods Mol Biol. 2009, 513:19-39; English, PLoS One. 2012 7:e47768; and Morozova, Genomics, 2008, 92:255-64, which are incorporated by reference for the general descriptions of the methods and the particular steps of the methods, including all starting products, reagents, and final products for each of the steps. The sequencing may be done by paired-end sequencing, although single read sequencing can be done in some cases.

Because the adapter dimers can be efficiently removed using EndoMS, there is no need to remove the adapter dimers by size separation. As such, in some embodiments, the method may be done without enriching for the adapter-ligated double-stranded nucleic acid fragments by size. For example, there is no need to perform a size separation after ligation but before amplification, after amplification and before sequencing, or after ligation and before sequencing (if the sample is not amplified beforehand).

In some embodiments, the ligation and EndoMS steps may be done in the same vessel. In these embodiments, the method may be done by incubating a reaction mix comprising the T-tailed double-stranded adapter, the A-tailed double-stranded fragments of nucleic acid, ligase, and the EndoMS, to produce the ligation products and cleave both strands of the adapter dimers. In some embodiments, the EndoMS may be thermostable. In these embodiments, the ligation reaction may be terminated and the EndoMS may be activated by changing the temperature of the reaction mix. For example, the ligation step may be performed at a temperature of between 15° C. to 25° C., and the EndoMS treatment step may be done at a temperature that is at least 10° C., at least 20° C., or at least 30° C. higher than the ligation step (e.g., at a temperature of at least 35° C., at least 45° C., or at least 55° C.). Alternatively, the EndoMS may be mesophilic.

Also provided herein is a method for cleaving a nucleic acid (such as DNA). In these embodiments, the method may comprise hybridizing the nucleic acid with an oligonucleotide that is not perfectly complementary to a target sequence to produce a duplex that comprises one or more mismatches (e.g., a single mismatch), and treating the duplex with EndoMS, thereby cleaving the nucleic acid at the target sequence. In these embodiments, the nucleic acid (e.g. DNA) may be single-stranded (e.g., which can be made by denaturing a sample that comprises dsDNA). Alternatively, the nucleic acid (e.g. DNA) may be double-stranded, and the hybridizing is done by strand invasion. This can be done by, e.g., incubating the reaction at a temperature that is insufficient to denature the nucleic acid in the sample but sufficient to allow strand invasion, such as a temperature in the range of 37° C. to 80° C. Strand invasion can be facilitated by single-stranded DNA binding proteins (SSBPs).

The oligonucleotide used in this method may be of any suitable length, e.g., between 15 and 100 nucleotides in length. In some embodiments, in the duplex, the oligonucleotide and nucleic acid comprise at least 8 nucleotides of perfect complementarity (at least 10, at least 15, or at least 20 nucleotides of perfect complementarity) on either side of a mismatch.

EndoMS can also be used to identify single nucleotide mismatches in a double-stranded nucleic acid such as any dsDNA. In some embodiments, this method may comprise: (a) reacting a sample comprising the double-stranded nucleic acid (e.g. dsDNA) with EndoMS to produce a reaction product, wherein the EndoMS cleaves both strands of the double-stranded nucleic acid only if it contains a mismatch; (b) subjecting the reaction product of (a) to amplification under conditions that amplify the double-stranded nucleic acid if it is uncleaved; and (c) detecting whether an amplification product is present, wherein the presence of the product indicates that the double-stranded nucleic acid does not have a mismatched nucleotide, and the absence of a product indicates that the double-stranded nucleic acid has a mismatched nucleotide. As would be apparent, step (b) can be done by PCR, using primers that flank the mismatch.

In this method, the double-stranded nucleic acid (e.g. dsDNA) may be from any source, or from a mixture of two or more different sources. In some embodiments, the mismatch may be at a ligation junction (e.g., at the ligation junction of an adapter dimer, as discussed above. In other embodiments, the dsDNA may be genomic DNA. In these embodiments, the mismatch may be caused by DNA damage. For example, the dsDNA may contain a damaged nucleotide. In other embodiments, the dsDNA may be a PCR product or double-stranded cDNA. In these embodiments, the mismatch may be caused by mis-incorporation of a nucleotide. In some embodiments, the double-stranded nucleic acid (e.g. dsDNA) may comprise two strands (e.g., a first strand and a second strand that may comprise one or more nucleotide substitutions relative to the first strand) that have been hybridized together. In one embodiment, the method comprises the initial step of hybridizing together a first nucleic acid strand and a second nucleic acid strand, wherein the first and second strands are not perfectly complementary (i.e. to form a duplex comprising one or more mismatches). In this embodiment, one of the strands in a duplex may contain the complement of a single nucleotide polymorphism relative to the other strand in the duplex.

In this method, the presence of an amplification product can be detected by gel or capillary electrophoresis, for example, although other methods that can separate DNA molecules by size can be used.

The detection may be qualitative or quantitative. In some embodiments, the results may be compared to a control. As such, the detecting step may comprise quantifying the amount of the amplification product.

The methods described above can be employed to analyze genomic DNA from virtually any organism, including, but not limited to, plants, animals (e.g., reptiles, mammals, insects, worms, fish, etc.), tissue samples, bacteria, fungi (e.g., yeast), phage, viruses, cadaveric tissue, archaeological/ancient samples, etc. In certain embodiments, the genomic DNA used in the method may be derived from a mammal, wherein certain embodiments the mammal is a human. In exemplary embodiments, the sample may contain genomic DNA from a mammalian cell, such as, a human, mouse, rat, or monkey cell. The sample may be made from cultured cells or cells of a clinical sample, e.g., a tissue biopsy, scrape or lavage or cells of a forensic sample (i.e., cells of a sample collected at a crime scene). In particular embodiments, the nucleic acid sample may be obtained from a biological sample such as cells, tissues, bodily fluids, and stool. Bodily fluids of interest include but are not limited to, blood, serum, plasma, saliva, mucous, phlegm, cerebral spinal fluid, pleural fluid, tears, lactal duct fluid, lymph, sputum, synovial fluid, urine, amniotic fluid, and semen. In particular embodiments, a sample may be obtained from a subject, e.g., a human. In some embodiments, the sample comprises fragments of human genomic DNA. In some embodiments, the sample may be obtained from a cancer patient. In some embodiments, the sample may be made by extracting fragmented DNA from a patient sample, e.g., a formalin-fixed paraffin embedded tissue sample. In some embodiments, the patient sample may be a sample of cell-free “circulating” DNA from a bodily fluid, e.g., peripheral blood, e.g., from the blood of a patient or of a pregnant female. The DNA fragments used in the initial step of the method should be non-amplified DNA that has not been denatured beforehand.

Kits

Also provided by this disclosure is a kit for practicing the subject method, as described above. For example, in some embodiments, the kit may comprise a T-tailed double-stranded adapter and an EndoMS enzyme. The kit may also contain a ligase (e.g., T4 DNA ligase), a reaction buffer (which may be in concentrated form) and/or reagents for A-tailing fragments (e.g., T4 DNA polymerase and Taq polymerase), etc. The various components of the kit may be present in separate containers or certain compatible components may be pre-combined into a single container, as desired.

In addition to above-mentioned components, the subject kits may further include instructions for using the components of the kit to practice the subject methods, i.e., to provide instructions for sample analysis. The instructions for practicing the subject methods are generally recorded on a suitable recording medium. For example, the instructions may be printed on a substrate, such as paper or plastic, etc. As such, the instructions may be present in the kits as a package insert, in the labeling of the container of the kit or components thereof (i.e., associated with the packaging or subpackaging) etc. In other embodiments, the instructions are present as an electronic storage data file present on a suitable computer readable storage medium, e.g., CD-ROM, diskette, etc. In yet other embodiments, the actual instructions are not present in the kit, but means for obtaining the instructions from a remote source, e.g., via the internet, are provided. An example of this embodiment is a kit that includes a web address where the instructions can be viewed and/or from which the instructions can be downloaded. As with the instructions, this means for obtaining the instructions is recorded on a suitable substrate.

EMBODIMENTS

Embodiment 1. A method for reducing adapter dimers, comprising:

-   -   (a) ligating a T-tailed double-stranded adapter to an A-tailed         double-stranded fragment of nucleic acid to produce ligation         products that comprise adapter-ligated fragments and         double-stranded adapter dimers that comprise a T:T mismatch at         the ligation junction; and     -   (b) cleaving both strands of the adapter dimers using EndoMS.

Embodiment 2. The method of embodiment 1, further comprising (c) amplifying the adapter-ligated fragments.

Embodiment 3. The method of embodiment 2, wherein the amplifying is done using primers that hybridize to the adapter, or complement thereof.

Embodiment 4. The method of embodiments 2 or 3, wherein the method is done without enriching for the adapter-ligated double stranded nucleic acid fragments by size.

Embodiment 5. The method of any prior embodiment, wherein the fragments are genomic fragments.

Embodiment 6. The method of any prior embodiment, wherein the T-tailed adapter is a Y adapter.

Embodiment 7. The method of any prior embodiment, wherein the T-tailed adapter is a loop adapter.

Embodiment 8. The method of any prior embodiment, wherein the method comprises incubating a reaction mix comprising the T-tailed double-stranded adapter, the A-tailed double-stranded fragments of nucleic acid, a ligase, and the EndoMS to produce the ligation products and cleaving both strands of the adapter dimers.

Embodiment 9. The method of any prior embodiment, wherein the EndoMS is thermostable.

Embodiment 10. A kit comprising:

-   -   (a) a T-tailed double-stranded adapter; and     -   (b) an EndoMS enzyme.

Embodiment 11. The kit of embodiment 10, wherein the kit further comprises a ligase.

Embodiment 12. The kit of any of embodiments 10-11, wherein the kit further comprises a reaction buffer.

Embodiment 13. A method, comprising:

-   -   (a) hybridizing a first single stranded nucleic acid with a         second single stranded nucleic acid that is not perfectly         complementary to a target sequence in the first nucleic acid to         produce a duplex nucleic acid that comprises one or more         mismatches; and     -   (b) treating the duplex nucleic acid with EndoMS, so as to         cleave the duplex nucleic acid at the target sequence.

Embodiment 14. The method of embodiment 13, wherein the second single stranded nucleic acid is an oligonucleotide.

Embodiment 15. The method of embodiments 13 or 14, wherein the first single stranded nucleic acid is denatured dsDNA.

Embodiment 16. The method of any of embodiments 13-15, wherein the hybridizing is done by strand invasion.

Embodiment 17. The method of any of embodiments 13-16, wherein, in the duplex, the first and second nucleic acids comprise at least 8 nucleotides of perfectly complementarity on each side of a mismatch.

Embodiment 18. A method for identifying a mismatched nucleotide in a double-stranded nucleic acid, comprising:

-   -   (a) reacting a sample comprising the nucleic acid with EndoMS to         produce a reaction product, wherein the EndoMS cleaves both         strands of the nucleic acid only if it contains a mismatch;     -   (b) subjecting the reaction product of (a) to amplification         under conditions that amplify the nucleic acid if it is         uncleaved; and     -   (c) detecting the presence of an amplification product, wherein         the presence of the product indicates that the double-stranded         nucleic acid does not have a mismatched nucleotide and the         absence of a product indicates that the double-stranded nucleic         acid has a mismatched nucleotide.

Embodiment 19. The method of embodiment 17, wherein step (b) is done by PCR, using primers that flank the mismatch.

Embodiment 20. The method of any of embodiments 18-19, wherein mismatch is at a ligation junction.

Embodiment 21. The method of any of embodiments 18-20, wherein the nucleic acid is an adapter dimer.

Embodiment 22. The method of any of embodiments 18-21, wherein the double-stranded nucleic acid is genomic DNA.

Embodiment 23. The method of any of embodiments 18-22, wherein the mismatch is caused by DNA damage.

Embodiment 24. The method of any of embodiments 18-23, wherein the double-stranded nucleic acid comprises two strands that have been hybridized together.

Embodiment 25. The method of any of embodiments 18-24, wherein the detecting step (c) comprises quantifying the amount of the amplification product.

Embodiment 26. A method for identifying a mismatched nucleotide in a double-stranded nucleic acid in vivo, comprising:

-   -   (a) reacting a nucleic acid in a cell sample with EndoMS to         produce a reaction product, wherein the EndoMS cleaves both         strands of the nucleic acid only if it contains a mismatch; and         wherein the EndoMS is (i) expressed by an extrachromosomal DNA         introduced in the cell; or (ii) introduced into the cell by a         liposome or transport agent;     -   (b) detecting cleavage of the nucleic acid in the cell.

EXAMPLES

Aspects of the present teachings can be further understood in light of the following examples, which should not be construed as limiting the scope of the present teachings in any way.

Example 1: EndoMS Treatment to Eliminate Oligonucleotides Having a Mismatch

To test the ability of EndoMS variants to cleave mismatched oligonucleotides, substrates were designed with perfect base pairing (T:A) or with a T:T mismatch (T:T) to model mismatch oligonucleotide assembly used in synthetic biology gene assembly.

An oligonucleotide was labeled by VIC at its 5′ end (VIC-CGCCAGGGTTTTCCCAGTCACGAC) (SEQ ID NO:105). The labeled oligonucleotide was annealed to the following oligonucleotides to form double-stranded substrates having either a T:A match (GTCGTGACTGGGAAAACCCTGGCG) (SEQ ID NO:106) or T:T mismatch (GTCGTGACTGGGTAAACCCTGGCG) (SEQ ID NO:107) (FIG. 1A). In 200 μl, EndoMS (1 μmol) was incubated with the T:A or T:T substrate (20 nM) at 37° C. for between 0 and 60 minutes in 1×NEBuffer 2 (50 mM NaCl, 10 mM Tris-HCl, 10 mM MgCl₂, 1 mM DTT, pH 7.9 at 25° C.). Aliquots (20 μl) were sampled and stopped with EDTA (50 mM final concentration) at various time points (0-60 minutes). Reaction products were separated by capillary electrophoresis using a 3730xl Genetic Analyzer (Applied Biosystems, Foster City, Calif.), and fluorescent peaks were analyzed using Peak Scanner software version 1.0 (Applied Biosystems, Foster City, Calif.) (14). The concentration of product (9 nt) was graphed as a function of time. EndoMS lacked activity on matched oligonucleotide substrates but cleaved two bp 5′ to the T:T mismatch on each strand leaving a 5′ overhang (FIG. 1B). This data shows that EndoMS can cleave both strands of a substrate that contains a T:T mismatch.

Example 2: Adapters that Contain a 3′ T or 3′U Overhang can Ligate Together to Form Dimers

Synthetic adapter oligonucleotide 1 pATCTGATCGGAAGAGCACACGTCTGAACTCCAGTCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGATCGGA (SEQ ID NO:108) and Synthetic Adapter oligonucleotide 2 p-AGAGCACACGTCTGAACTCCAGTCTACACTCTTTCCCTACACGACGCTCTTCCG (SEQ ID NO:109) are loop adapters. In each adapter, the 5′ end of the adapter is complementary to the 3′ end of the adapter, and the adapters form a stem-loop when they are annealed. The adapters, when they are in the hairpin structure, both contain a single nucleotide 3′ T overhang. Molecules that contain a single nucleotide 3′ T overhang should, in theory, only ligate to other molecules that contain a single nucleotide 3′ A overhang, not molecules that contain a single nucleotide 3′ T overhang.

The adapters were annealed and ligated to each other using T4 DNA ligase to produce a loop adapter dimer that contains a T:T mismatch at the adapter dimer junction (as shown in FIG. 2A). To demonstrate EndoMS cleavage of adapter dimers, in a 20 μl reaction, the loop adapter dimer (40 nM) was incubated with EndoMS (1 μmol) in 1×NEBNext Ultra™ II Q5® Master Mix (New England Biolabs, Ipswich, Mass.) for 1 hour at 37° C. Reaction products were analyzed by gel electrophoresis (FIG. 2B).

Intact adapter dimer runs at 130 nt by gel electrophoresis. Cleavage with EndoMS at the T:T mismatch will yield 65 nt products as resolved by gel electrophoresis.

Example 3: Removal of Adapter Dimers from NGS Library Preparation

To test if EndoMS cleaved and decreased adapter dimers formed during next generation sequencing library construction, a next generation sequencing library was prepared according to the manufacturer's protocol (NEBNext Ultra II DNA Library Prep Kit). Briefly, human DNA was sheared into 300 nt fragments by acoustic shearing. Then, sheared DNA fragments (10 ng) were end repaired, dA-tailed and ligated to NEBNext Adapter for IIlumina (15 μM). Excess small DNAs (primers) were effectively removed from the reaction mix using SPRI select beads. Although adapters and adapter dimers are considered to be small DNA, removal of adapter dimers capable of binding primers intended for amplifying target DNA using any current separation method such as beads is generally incomplete.

Instead, residual adapter dimers were removed enzymatically as follows: NEBuffer 1 was added to the cleaned up libraries and split in half. Aliquots (20 μl) were treated with 0.3, 1.25 or 5 pmoles EndoMS or water and incubated for 1 hour at 37° C. Reactions were then PCR amplified for 10 cycles using Index Primer 1 and Universal Primer (as described by the manufacture's protocol (New England Biolabs, Ipswich, Mass.)). Reaction products were separated and analyzed using the Agilent Bioanalyzer. The data for the different reactions is shown in FIG. 3A (0.3 pmoles EndoMS), FIG. 3B (1.25 pmoles EndoMS) and FIG. 3C (5 pmols EndoMS). Results show that EndoMS treatment reduces adapter dimers.

Example 4: Mismatch Oligonucleotide Targeted dsDNA Cleavage Using EndoMS

Synthetic oligonucleotides can be designed to complement sequences in a target DNA such as genomic DNA or plasmid DNA where the design includes a single mismatch at a T to create a T:T mismatch. Alternatively, the design includes more than one mismatch at a T to create more than one T:T mismatch. EndoMS or thermostable EndoMS can then be used to cleave the duplex at the mismatch site using any of the cleavage methods described above. Briefly, the mismatch oligonucleotide (1 μM) is annealed to the target DNA (0.5 μM) in 1×NEBuffer 2 by heating to 95° C. for 5 minutes then cooling to 25° C. EndoMS is then added to cleave at mismatches. Alternatively, the mismatch oligonucleotide (1 μM) is annealed to the target DNA (0.5 μM) in 1× NEBuffer 2 and 1 pmol Thermostable EndoMS and cycled between 95° C. and 37° C. After annealing, a mismatch will form and EndoMS will cleave the heteroduplex DNA. The result of this method is a dsDNA break at a specific site directed by mismatch cleavage by EndoMS. This method provides a dsDNA cleavage reagent whose specificity is targeted by a mismatched oligonucleotide. 

1-10. (canceled)
 10. A kit comprising: (a) a T-tailed double-stranded adapter; and (a) an EndoMS enzyme.
 11. The kit of claim 10, wherein the kit further comprises a ligase.
 12. The kit of claim 10, wherein the kit further comprises a reaction buffer.
 13. A method, comprising: (a) hybridizing a first single stranded nucleic acid with a second single stranded nucleic acid that is not perfectly complementary to a target sequence in the first nucleic acid to produce a duplex nucleic acid that comprises one or more mismatches; and (b) treating the duplex nucleic acid with EndoMS, so as to cleave the duplex nucleic acid at the target sequence.
 14. The method of claim 13, wherein the second single stranded nucleic acid is an oligonucleotide.
 15. The method of claim 13, wherein the first single stranded nucleic acid is denatured dsDNA.
 16. The method of claim 13, wherein the hybridizing is done by strand invasion.
 17. The method of claim 13, wherein, in the duplex, the first and second nucleic acids comprise at least 8 nucleotides of perfectly complementarity on each side of a mismatch.
 18. A method for identifying a mismatched nucleotide in a double-stranded nucleic acid, comprising: (a) reacting a sample comprising the double-stranded nucleic acid with EndoMS to produce a reaction product, wherein the EndoMS cleaves both strands of the double-stranded nucleic acid only if it contains a mismatch; (b) subjecting the reaction product of (a) to amplification under conditions that amplify the double-stranded nucleic acid if it is uncleaved but not if it is cleaved; and (c) detecting the presence of an amplification product, wherein the presence of the product indicates that the double-stranded nucleic acid does not have a mismatched nucleotide and the absence of a product indicates that the double-stranded nucleic acid has a mismatched nucleotide.
 19. The method of claim 18, wherein step (b) is done by PCR, using primers that flank the mismatch.
 20. The method of claim 18, wherein mismatch is at a ligation junction.
 21. The method of claim 18, wherein the double-stranded nucleic acid is an adapter dimer.
 22. The method of claim 18, wherein the double-stranded nucleic acid is genomic DNA.
 23. The method of claim 18, wherein the detecting step (c) comprises quantifying the amount of the amplification product.
 24. A method of cleaving a double-stranded adapter dimer, the method comprising incubating the adapter dimer with EndoMS to form adapter dimer cleavage products, wherein the adapter dimer comprises at least one mismatched oligonucleotide.
 25. A method according to claim 24, wherein the cleavage products comprise products corresponding to cleavage of both strands of the adapter dimer.
 26. A method according to claim 24, wherein the adaptor dimer is a loop adapter or a hairpin adaptor.
 27. A method of cleaving a double-stranded nucleic acid molecule, the method comprising incubating the double-stranded nucleic acid molecule with EndoMS to form cleavage products, wherein the double-stranded nucleic acid molecule comprises at least one mismatched oligonucleotide.
 28. A method according to claim 27, wherein the cleavage products comprise products corresponding to cleavage of both strands of the double-stranded nucleic acid molecule.
 29. A method according to claim 27, wherein the double-stranded nucleic acid molecule is an adapter dimer.
 30. A method according to claim 27, wherein the adaptor dimer is a loop adapter or a hairpin adaptor. 